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Abstract During the past years there has been an explosion of interest in learning 
methods based on sparsity regularization. In this paper, we discuss a general class 
of such methods, in which the regularizer can be expressed as the composition of 
a convex function co with a linear function. This setting includes several methods 
such the group Lasso, the Fused Lasso, multi-task learning and many more. We 
present a general approach for solving regularization problems of this kind, under 
the assumption that the proximity operator of the function (0 is available. Further- 
more, we comment on the application of this approach to support vector machines, 
a technique pioneered by the groundbreaking work of Vladimir Vapnik. 
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1 Introduction 

In this paper, we address supervised learning methods which are based on the opti- 
mization problem 

mm{f(x)+g(x)}, (1) 

xeR d 

where the function / measures the fit of a vector x (linear predictor) to available 
training data and g is a penalty term or regularizer which encourages certain types 
of solutions. Specifically, we let f(x) = E(y,Ax), where E : W x W — > [0,°°) is an 
error function, y G M s is a vector of measurements and A <E W xd a matrix, whose 
rows are the input vectors. This class of regularization methods arise in machine 
learning, signal processing and statistics and have a wide range of applications. 

Different choices of the error function and the penalty function correspond to 
specific techniques. In this paper, we are interested in solving problem ((TJ when / 
is a strongly smooth convex function (such as the square error E(y, Ax) = \\y— Ax|||) 
and the penalty function g is obtained as the composition of a "simple" function 
with a linear transformation B, that is, 

g{x) = co(Bx) , (2) 

where B is a prescribed m x d matrix and o is a nondifferentiable convex function 
on W 1 . The class of regularizes (f2) includes a variety of methods, depending on the 
choice of the function CO and of matrix B. Our motivation for studying this class of 
penalty functions arises from sparsity-inducing regularization methods which con- 
sider co to be either the l\ norm or a mixed l\-l p norm. When B is the identity ma- 
trix and p = 2, the latter case corresponds to the well-known Group Lasso method 
ll36ll . for which well studied optimization techniques are available. Other choices of 
the matrix B give rise to different kinds of Group Lasso with overlapping groups 
|[T5l[37l . which have proved to be effective in modeling structured sparse regression 
problems. Further examples can be obtained by considering composition with the 
t\ norm, for example this includes the Fused Lasso penalty function BTI and the 
graph prediction problem of ifPJl . 

A common approach to solve many optimization problems of the general form 
(Q]) is via proximal-gradient methods. These are first-order iterative methods, whose 
computational cost per iteration is comparable to gradient descent. In some prob- 
lems in which g has a simple expression, proximal-gradient methods can be com- 
bined with acceleration techniques |22 24 32], to yield significant gains in the num- 
ber of iterations required to reach a certain approximation accuracy of the minimal 
value. The essential step of proximal-gradient methods requires the computation 
of the proximity operator of function g, see Definition [T] below. In certain cases of 
practical importance, this operator admits a closed form, which makes proximal- 
gradient methods appealing to use. However, in the general case (f2]i the proximity 
operator may not be easily computable. 

We describe a general technique to compute the proximity operator of the com- 
posite regularizer (f2) from the solution of a fixed point problem, which depends on 
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the proximity operator of the function CO and the matrix B. This problem can be 
solved by a simple and efficient iterative scheme when the proximity operator of 
CO has a closed form or can be computed in a finite number of steps. When / is a 
strongly smooth function, the above result can be used together with Nesterov's ac- 
celerated method ll22l l24l to provide an efficient first-order method for solving the 
optimization problem (HJ. 

The paper is organized as follows. In Section|2] we review the notion of proximity 
operator, useful facts from fixed point theory and present a convergent algorithm for 
the solution of problem (Q]i when / is quadratic function and then an algorithm 
to solve the associated optimization problem (Q~|i. In Section [3] we discuss some 
examples of composite functions of the form (O which are valuable in applications. 
In SectionHJwe apply our observations to support vector machines and obtained new 
algorithms for the solution of this problem. Finally, Section [5] contains concluding 
remarks. 



2 Fixed Point Algorithms Based on Proximity Operators 

In this section, we present an optimization approach which use fixed point algo- 
rithms for nonsmooth problems of the form (Q3 under the assumption @. We first 
recall some notation and then move on to present an approach to compute the prox- 
imity operator for composite regularizes. 



2.1 Notation and Problem Formulation 

We denote by (-, •) the Euclidean inner product on W 1 and let j| ■ 1 1 2 be the induced 
norm. If v : K — > M, for every x £ W 1 we denote by v(x) the vector {v(xi))f =1 . For 

every p > 1, we define the £ p norm of x as \\x\\ p = (Lf=i \ x i\ p ) p ■ 

As the basic building block of our method, we consider the optimization problem 
(Q]) in the special case when / is a quadratic function and the regularization term g 
is obtained by the composition of a convex function with a linear function. That is, 
we consider the problem 



where x is a given vector in W 1 and Q a positive definite d x d matrix. The devel- 
opment of a convergent method for the solution of this problem requires the well- 
known concepts of proximity operator and subdifferential of a convex function. Let 
us now review some of salient features of these important notions which are needed 
for the analysis of problem <j3j. 

The proximity operator on a Hilbert space was introduced by Moreau in ||20l . 




(3) 
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Definition 1. Let © be a real valued convex function on Mr. The proximity operator 
of CO is defined, for every x G M d by 



The proximity operator is well defined, because the above minimum exists and is 
unique. 

Recall that the subdifferential of CO at x is defined as dd)(x) = {u : u G M d , (y — 
x,u) + a>(x) < Co(y), Vy G W 1 }. The subdifferential is a nonempty compact and 
convex set. Moreover, if CO is differentiable at x then its subdifferential at x consists 
only of the gradient of CO at x. 

The relationship between the proximity operator and the subdifferential of CO 
are essential for algorithmic developments for the solution of (01, J2 [9] [19] |2D . 
Generally the proximity operator is difficult to compute since it is expressed as the 
minimum of a convex optimisation problem. However, the are some rare circum- 
stances where it can obtained explicitly, for examples when co(x) is a multiple of 
the l\ norm of x the proximity operator relates to soft thresholding and moreover a 
related formula allows for the explicit identification of the proximity operator for the 
£2 norm, see, for example, |2][9][T9). Our optimisation problem (0 can be reduced 
to the identification of the proximity operator for the composition function (OoB. 
Although the prox of CO may be readily available, it may still be a computational 
challenge to obtain the prox of CQoB. We consider this essential issue in the next 
section. 



2.2 Computation of a Generalized Proximity Operator with a Fixed 
Point Method 

In this section we consider circumstances in which the proximity operator of co can 
be explicitly computed in a finite number of steps and seek an algorithm for the 
solution of the optimisation problem ([3j. 

As we shall see, the method proposed here applies for any positive definite ma- 
trix Q. This will allow us in a future publication to provide a second order method 
for solving ([T). For the moment, we are content in focusing on (f3]i by providing a 
technique for the evaluation of pmx moB . 

First, we observe that the minimizer y of (0 exists and is unique. Indeed, this 
vector is characterised by the set inclusion 



To make use of this observation, we introduce the affine transformation A : R m — > W 
defined, for fixed x G W 1 , A > 0, at z 6 R" ! by 




(4) 



Qy G x — B T dco(By) . 
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Az:= (I-XBQ- 1 B T )z + BQ- 1 x 

and the nonlinear operator H : W" — > M." 1 

H := (l-pmx^joA. (5) 

The next theorem from |2] is a natural extension of an observation in Ifl9ll , which 
only applies to the case Q = I. 

Theorem 1. If CO is a convex function on W, B € R mxd , x E W l , X is a positive 
number, the operator H is defined as in (01, and y is the minimize r of ^ then 

y = Q-\x-XB T v) 

if and only ifvE W is a fixed point ofH. 

This theorem provides us with a practical tool to solve problem (01 numerically 
by using Picard iteration relative to the nonlinear mapping H. Under an additional 
hypothesis on the matrix BQ~ 1 Q T , the mapping H is non-expansive, see [2|. There- 
fore, Opial's Theorem ll38l allows us to conclude that the Picard iterate converges 
to the solution of ([3), see J3] [19] for a discussion of this issue. Furthermore, under 
additional hypotheses the mapping H is a contraction. In that case, the Picard iterate 
converges linearly. 

We may extend the range of applicability of our observations and provide a fixed 
point proximal-gradient method for solving problem ([T] when the regularizer has the 
form (O and the error / is a strongly smooth convex function, that is, the gradient of 
/, denote by V/, is Lipschitz continuous with constant L. So far, the convergence of 
this extension has yet to be analyzed. The idea behind proximal-gradient methods, 
see |9]|24][32] and references therein, is to update the current estimate of the solution 
x t using the proximity operator of g and the gradient of /. This is equivalent to 
replacing / with its linear approximation around a point which is a function of the 
previous iterates of the algorithm. The simplest instance of this iterative algorithm 
is given in Algorithm lQ] Extensions to acceleration schemes are described in J2). 



Algorithm 1 Proximal-gradient & fixed point algorithm. 

xi <- 

fort=l,2,... do 

Compute x, + i <- prox» oS (x, - jV/(x f )) 
by the Picard process. 

end for 
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2.3 Connection to the forward-backward algorithm 



In this section, we consider the special case Q = I and interpret the Picard iteration 
of H in terms of a forward-backward algorithm in the dual, for a discussion of the 
forward-backward algorithm, see for example []9| 
The Picard iteration is defined as 

v,+i «- (/-proXffl)((/-ABfi T )v,+Bx) (6) 

We first recall the Moreau decomposition, see, for example, [9] and references 
therein, which relates the proximity operators of a lower semicontinuous convex 
function <p : W" — > RU {+°°} and its conjugate, 

/ = proXp + prox r . (7) 

Using equation (0, the iterative step ^ becomes 

Vf+i <-prox/a>\* (v, - (\BB J v, —Bx)) 

which is a forward-backward method. We can further simplify this iteration by in- 
troducing the vector z t Xv, and obtaining the iterative algorithm 

Zf+i <- Aprox^* (^-zt - {BB T z t -Bx 

Using the readily verified formulas 

iprox^oA7 = prox^ a/ 

and 

see, for example, (5), we obtain the equivalent forward-backward iteration 

Zt+\ <- prox AQ ,« (z t - {XBB T z t - XBx)) . 

This method is a forward-backward method of the type considered in (8], Alg. 10.3] 
and solves the minimization problem 



mm 



l [ ^\\B T z-x\\ 2 + co*(z):zeR"^ . 



This minimization problem in turn can be viewed as the dual of the primal problem 

min | i ||m|| 2 - (x, u) + CO(Bu) : u E R d \ (8) 
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by using Fenchel's duality theorem, see, for example, @. Moreover, the primal and 
dual solutions are related through the conditions — B T z = u—x and z £ dco{Bu), the 
first of which implies that x — XB J v equals the solution of the proximity problem 
([8]), that is, equals prox ffloB (x). 



3 Examples of Composite Functions 

In this section, we provide some examples of penalty functions which have appeared 
in the literature that fall within the class of linear composite functions (|2). 

We define for every d £ N, x £ W 1 and J C{l,...,d}, the restriction of the vector 
x to the index set / as x\j = (x, : i £j). Our first example considers the Group Lasso 
penalty function, which is defined as 

fl*sL(*) = i>i/j 2 , (9) 

1=1 

where //< are prescribed subsets of {l,...,d} (also called the "groups") such that 
U* =1 /^ = {1 , . . . ,d}. The standard Group Lasso penalty, see, for example, ll36l . cor- 
responds to the case that the collection of groups {//< : 1 < £ < k} forms a partition 
of the index set {l,...,d}, that is, the groups do not overlap. In this case, the op- 
timization problem for CO = COql decomposes as the sum of separate problems 
and the proximity operator is readily obtained by using the proximity operator of 
the ^2- norm t° eac h group separately. In many cases of interest, however, the groups 
overlap and the proximity operator cannot be easily computed. 

Note that the function (0 is of the form (|2). We let di = \J(\, m = Yle=i dt an d 
define, for every z £ K m , (o(z) = ]£* =1 \\z1W2, where, for every I = l,...,k we let 
U = (zi : Lj=! dj < i < Yfj=\ dj)- Moreover, we choose B T = [Bj, . . . ,Bj], where B e 
is a d( x d matrix defined as 



(B t )ij = 



1 ifj=m 

otherwise 



where for every / C {l,...,d} and / £ { 1 , . . . , \ J | }, we denote by /[/] the ;-th largest 
integer in /. 

The second example concerns the Fused Lasso iFJTI . which considers the penalty 
function x i-> g(x) = Yf!=\ \ x i — x i+i I- This function falls into the class (O. Indeed, 
if we choose co to be the l\ norm and B the first order divided difference matrix 



B = 



1-1 ... 
1-10 



we get back g. The intuition behind the Fused Lasso is that it favors vectors which 
do not vary much across contiguous components. Further extensions of this case 
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may be obtained by choosing B to be the incidence matrix of a graph, leading to the 
penalty £?. \x, —Xj\. This is a setting which is relevant, for example, in online 
learning over graphs lTT3][T4l . 

The next example considers composition with orthogonally invariant (OI) norms. 
Specifically, we choose a symmetric gauge function h, that is, a norm h, which is 
both absolute and invariant under permutations 11351 and define the function co : 
R dx " -> [0,»), at X by the formula CO(X) = h(o(X)), where a(X) G [0,°°) r , r = 
min(d,n) is the vector formed by the singular values of matrix X, in non-increasing 
order. An example of OTnorm are Schatten p-norms, which correspond to the case 
that co is the fp-norm. The next proposition provides a formula for the proximity 
operator of an OTnorm. A proof can be found in (]2]. 

Proposition 1. With the above notation, it holds that 



where X = t/diag((7(X))V T and U and V are the matrices formed by the left and 
right singular vectors ofX, respectively. 

We can compose an OTnorm with a linear transformation B, this time between 
two spaces of matrices, obtaining yet another subclass of penalty functions of the 
form d2J». This setting is relevant in the context of multi-task learning. For example, 
in HI h is chosen to be the trace or nuclearnorm and a specific linear transformation 
which models task relatedness is considered. Specifically, the regulariser is given by 
g(X) = ||(7 (X(I — 7j<?e T )) ||j, where e S M. d is the vector all of whose components 
are equal to one. 



4 Application to Support Vector Machines 

In this section, we turn our attention to the important topic of support vector ma- 
chines (SVMs), which are widely used in data analysis. SVMs were pioneered by 
the fundamental work of Vapnik j6][l0][33] and inspired one of us to begin research 
in machine learning [11 1 1271 1261 . For that we are all very grateful to Vladimir Vapnik 
for his fundamental contributions to machine learning. 

First, we recall the SVM primal and dual optimization problems, |33l . To sim- 
plify the presentation we only consider the linear version of SVMs. A similar treat- 
ment using feature map representations is straightforward and so will not be dis- 
cussed here, although this in a an important extension of practical value. Moreover, 
we only consider SVMs for classification, but our approach can be applied to SVM 
regression and other variants of SVMs which have appeared in the literature. 

The optimisation problem of concern here is given by 



prox hoa (X) = I/diag(prox A (<7(X)))V 



T 




(10) 
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where V*(z) = max(0, 1 — z), z G R, is the hinge loss and C is a positive parameter bal- 
ancing empirical error against margin maximization. We let Xj 6 R d , i G {1, . . . ,m}, 
be the input data and y, G { — 1 , +1 } be the class labels. 

Problem (TToT > can be viewed as a proximity operator computation of the form 
©, with Q = /, x = 0, (0{z) = C£"UV{zi) and B = [y lXl . . .y m JC m ] T . The proximity 
operator of the hinge loss is separable across the coordinates and simple to compute. 
In fact, for any £ G R and /x > it is given by the formula 

prox MV (£) =min(£ + /i,max(C,l)). 
Hence, we can solve problem (TTOb by Picard iteration, namely 

y,+i <- (/-prox») ((/ - lBB J )v t ) (11) 
with A satisfying < X < — , 2 T . , which ensures that the nonlinear mapping is 

^•max [pB } 

strictly contractive. Note that v t G M" 1 and that this iterative scheme may be inter- 
preted as acting on the SVM dual, see Section l2~3l In fact, there is a simple relation 
to the support vector coefficients given by the equation v = |a. Consequently, this 
algorithmic approach is well suited when the sample size m is small compared to the 
dimensionality d. An estimate of the primal solution, if required, can be obtained by 
using the formula w = — XB T v. Also, when d < m the last equation, relating w and 
v, cannot be inverted. Hence, (fTT~b is not useful in this case. 
Recall that the dual problem of (TlOb is given ll33l 

min ji||B T aj| 2 -l T a: «G[0,C] m |. (12) 

This problem can be seen as the computation of a generalized proximity operator of 
the type (f3). To explain what we have in mind we use the notation as the elemen- 
twise product between matrices of the same size (Schur product) and introduce the 
kernel matrix K — [x\ ... x m ] T [x\ ... x m ] . 

Using this terminology, we conclude that problem (fT2t is of the form (O with 
Q = KQyy T , x=l (the vector of all ones), B = I and CO = co c , where fflt(oc) — if 
<X G [0,C]'" and (£>c{a) = +°° otherwise. Furthermore, the proximity operator for co 
is given by the projection on the set [0,C]"\ that is prox ffl( ., (a) = min(C,max(0, a)). 
These observations yield the Picard iteration 

v,+i <- (/-prox fflc ) ((I-X(K- l Qyy T ))v t + (K- l Qyy T )l) (13) 

with < A < 2X^ m (K). This iterative scheme requires that the kernel matrix K is 
invertible, which is frequently the case, for example, in the case of Gaussian kernels. 
Another requirement is that either K~ l has to be precomputed or a linear system 
involving K has to be solved at every iteration, which limits the scalability of this 
scheme to very large samples. In contrast, the iteration dTTJ can always be applied, 
even when K is not invertible. In fact, when K, and equivalently BB T , is invertible 
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then both iterative methods (fTTb . (fl~3T > converge linearly at a rate which depends on 
the condition number of K, see J2] [19] . 

Recall that algorithm ( fTTT > is equivalent to a forward-backward method in the 
dual, see Section [23l Thus, an accelerated variant akin to Nesterov's optimal method 
and FISTA [3) could also be used. However, in the case of an invertible kernel 
matrix, both versions converge linearly ll24l and hence it is not clear whether there 
is any practical advantage from the Nesterov update. Furthermore, algorithm (Q~3) 
could also be modified in a similar way. 

On the other hand, if m > d, we would directly attempt to solve the primal prob- 
lem. In this case, the Nesterov smoothing method can be employed, l23l . An ad- 
vantage of such a method is that it only stores 0(d) variables, even though it needs 
0(md) computations per iteration. The method described above, based on Picard 
iteration, requires min(0(md) 7 0(m 2 )) cost per iteration and stores 0(m) variables. 

Let us finally remark that iterative methods similar to (fTTl or (TT~3T > can be applied 
to £2 regularization problems, other than SVMs, provided that the proximity oper- 
ator of the corresponding loss function is available. Common choices for the loss 
function, other than the hinge loss, are the logistic and square loss functions lead- 
ing to logistic regression and least squares regression, respectively. In particular, in 
these two cases, the primal objective ( TTOb is both smooth and strongly convex and 
hence a linearly convergent gradient descent or accelerated gradient descent method 
can be used l25l . regardless of the conditioning of the kernel matrix. 



5 Conclusion 

We presented a general approach to solve a class of nonsmooth optimization prob- 
lems, whose objective function is given by the sum of a smooth term and a nons- 
mooth term which is obtained by linear function composition. The prototypical ex- 
ample covered by this setting is a linear regression regularization method, in which 
the smooth term is an error term and the nonsmooth term is a regularizer which 
favors certain desired parameter vectors. An important feature of our approach is 
that it can deal with a rich class of regularizers and, as shown numerically in [0, is 
competitive with the state of the art methods. Using these ideas, we also provided 
a fixed-point scheme to solve support vector machines. Although numerical exper- 
iments have yet to be done, we believe this method is simple enough to deserve 
attention by practitioners. 

We believe that the method presented here should be throughly investigated both 
in terms of convergence analysis, where ideas presented in 04l may be valuable, 
and numerical performance with other methods, such as alternate direction of mul- 
tipliers, see, for example, |4|, block coordinate descent, alternate minimization and 
others. Finally, there are several other machine learning problems where ideas pre- 
sented here apply. For example, in that regard we mention multiple kernel learning, 
see for example, ff8l [28l l29l l30l and references therein, some structured sparsity 
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regularizers Ifl6] [T71 and multi-task learning, see, for example |T|[7][T2)- We leave 
these tantalizing issues for future investigation. 
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